Midterm

Data Visualization (STAT 302)

Author

INSTRUCTIONS

Overview

The midterm attempts to bring together everything you have learned to date. You’ll be asked to replicate a series of graphics to demonstrate your skills and provide short descriptions/explanations regarding issues and concepts in ggplot2.

You are free to use any resource at your disposal such as notes, past labs, the internet, fellow students, instructor, TA, etc. However, do not simply copy and paste solutions. This is a chance for you to assess how much you have learned and determine if you are developing practical data visualization skills and knowledge.

Datasets

The datasets used for this dataset are stephen_curry_shotdata_2014_15.txt, ga_election_data.csv, and ga_map.rda. May also need the nbahalfcourt.jpg image.

Below you can find a short description of the variables contained in stephen_curry_shotdata_2014_15.txt:

  • GAME_ID - Unique ID for each game during the season
  • PLAYER_ID - Unique player ID
  • PLAYER_NAME - Player’s name
  • TEAM_ID - Unique team ID
  • TEAM_NAME - Team name
  • PERIOD - Quarter or period of the game
  • MINUTES_REMAINING - Minutes remaining in quarter/period
  • SECONDS_REMAINING - Seconds remaining in quarter/period
  • EVENT_TYPE - Missed Shot or Made Shot
  • SHOT_DISTANCE - Shot distance in feet
  • LOC_X - X location of shot attempt according to tracking system
  • LOC_Y - Y location of shot attempt according to tracking system

The ga_election_data.csv dataset contains the state of Georgia’s county level results for the 2020 US presidential election. Here is a short description of the variables it contains:

  • County - name of county in Georgia
  • Candidate - name of candidate on the ballot,
  • Election Day Votes - number of votes cast on election day for a candidate within a county
  • Absentee by Mail Votes - number of votes cast absentee by mail, pre-election day, for a candidate within a county
  • Advanced Voting Votes - number of votes cast in-person, pre-election day, for a candidate within a county
  • Provisional Votes - number of votes cast on election day for a candidate within a county needing voter eligibility verification
  • Total Votes - total number of votes for a candidate within a county

We have also included the map data for Georgia (ga_map.rda) which was retrieved using tigris::counties().

Exercise 1

Using the stephen_curry_shotdata_2014_15.txt dataset replicate, as close as possible, the graphics below (2 required, 1 optional/bonus). After replicating the graphics provide a summary of what the graphics indicate about Stephen Curry’s shot selection (i.e. distance from hoop) and shot make/miss rate and how they relate/compare across distance and game time (i.e. across quarters/periods).

Plot 1

Hints:

  • Figure width 6 inches and height 4 inches, which is taken care of in code chunk yaml with fig-width and fig-height
  • Use minimal theme and adjust from there
  • While the plot needs to be very close to the one shown it does not need to be exact in terms of values. If you want to make it exact here are some useful values used, sometimes repeatedly, to make the plot: 12 & 14
Code
# data prep
steph_curry <- steph_curry %>%
  mutate(
    period = factor(
      period,
      levels = c(1, 2, 3, 4, 5),
      labels = c("Q1", "Q2", "Q3", "Q4", "OT")
      )
    )


Plot 2

Hints:

  • Figure width 6 inches and height 4 inches, which is taken care of in code chunk yaml with fig-width and fig-height
  • Use minimal theme and adjust from there
  • Useful hex colors: "#5D3A9B" and "#E66100"
  • No padding on vertical axis
  • Transparency is being used
  • annotate() is used to add labels
  • While the plot needs to be very close to the one shown it does not need to be exact in terms of values. If you want to make it exact here are some useful values used, sometimes repeatedly, to make the plot: 0, 0.04, 0.07, 0.081, 0.25, 3, 12, 14, 27


Plot 3

Important

Plot 3 is required for graduate students, but is optional for undergraduate students.

Hints:

  • Figure width 7 inches and height 7 inches, which is taken care of in code chunk yaml with fig-width and fig-height
  • Colors used: "grey", "red", "orange" "yellow" (don’t have to use "orange", you can get away with using only "red" and "yellow")
  • To top code so 15+ is the highest value, you need to set the limits in the appropriate scale while also also setting the na.value to the top color
  • While the plot needs to be very close to the one shown it does not need to be exact in terms of values. If you want to make it exact here are some useful values used, sometimes repeatedly, to make the plot: 0, 0.7, 5, 12, 14, 15, 20
Code
# importing image of NBA half court
court <- grid::rasterGrob(
  jpeg::readJPEG(
    source = "data/nbahalfcourt.jpg"),
  width = unit(1, "npc"), 
  height = unit(1, "npc")
)

# plot
ggplot() +
  annotation_custom(
    grob = court,
    xmin = -250, xmax = 250,
    ymin = -52, ymax = 418
  ) +
  coord_fixed() +
  xlim(250, -250) +
  ylim(-52, 418)


Summary

Provide a summary of what the graphics above indicate about Stephen Curry’s shot selection (i.e. distance from hoop) and shot make/miss rate and how they relate/compare across distance and game time (i.e. across quarters/periods).


Exercise 2

Using the ga_election_data.csv dataset in conjunction with mapping data ga_map.rda replicate, as close as possible, the graphic below. Note the graphic is comprised of two plots displayed side-by-side. The plots both use the same shading scheme (i.e. scale limits and fill options).

Background Information: Holding the 2020 US Presidential election during the COVID-19 pandemic was a massive logistical undertaking. Voter engagement was extremely high which produced a historical high voting rate. Voting operations, headed by states, ran very monthly and encountered few COVID-19 related issues. The state of Georgia did a particularly good job at this by encouraging their residents to use early voting. About 75% of the vote in a typical county voted early! Ignoring county boundaries, about 4 in every 5 voters, 80%, in Georgia voted early.

While it is clear that early voting was the preferred option for Georgia voters, we want to investigate whether or not one candidate/party utilized early voting more than the other — we are focusing on the two major candidates. We created the graphic below to explore the relationship of voting mode and voter preference, which you are tasked with recreating.

After replicating the graphic provide a summary of how the two maps relate to one another. That is, what insight can we learn from the graphic.

Hints:

  • Figure width 7 inches and height 7 inches, which is taken care of in code chunk yaml with fig-width and fig-height
  • Make two plots, then arrange plots accordingly using patchwork package
  • patchwork::plot_annotation() will be useful for adding graphic title and caption; you’ll also set the theme options for the graphic title and caption (think font size and face)
  • ggthemes::theme_map() was used as the base theme for the plots
  • scale_*_gradient2() will be helpful
  • Useful hex colors: "#5D3A9B" and "#1AFF1A"
  • While the plot needs to be very close to the one shown it does not need to be exact in terms of values. If you want to make it exact here are some useful values used, sometimes repeatedly, to make the plot: 0.5, 0.75, 1, 10, 12, 14, 24
Important

Add comments to the code below where indicated. The added comments should concisely describe what the following line(s) of code do in the data wrangling process

Code
# data
ga_graph <- ga_dat %>% 
  # ADD COMMENT !!!!
  mutate(
    prop_pre_eday = (absentee_by_mail_votes + advanced_voting_votes) / total_votes
  ) %>% 
  # ADD COMMENT !!!!
  select(-contains("_vote")) 

# biden map data
biden_map_data <- ga_map %>% 
  # ADD COMMENT !!!!
  left_join(
    ga_graph %>% 
      filter(candidate == "Joseph R. Biden"),
    by = c("name" ="county")
  )

# trump map data
trump_map_data <- ga_map %>% 
  # ADD COMMENT !!!!
  left_join(
    ga_graph %>% 
      filter(candidate == "Donald J. Trump"),
    by = c("name" ="county")
  )

# biden plot

# trump plot

# final plot


Summary

Provide a summary of how the two maps relate to one another. That is, what insight can we learn from the graphic.

Exercise 3

Part 1

In 3-5 sentences, describe the core concept/idea and structure of the ggplot2 package.


Part 2

Describe each of the following:

  1. ggplot()
  2. aes()
  3. geoms
  4. stats
  5. scales
  6. theme()


Part 3

Explain the difference between using this code geom_point(aes(color = VARIABLE)) as opposed to using geom_point(color = VARIABLE).